This report discusses the HDI in Indonesia. HDI is an important indicator to assess the development of a country by examining the level of human development in a region. HDI encompasses health, income, education, and other factors within a region. The purpose of selecting this data is to identify and analyze the HDI across various regions in Indonesia and determine whether it is evenly distributed.
We chose this topic because HDI provides insights into the quality of life of residents in different regions of Indonesia. By understanding the HDI of a region, we and others can identify areas needing attention to improve the well-being of the local population.
This report is intended for researchers and other interested parties in findings related to the HDI in Indonesia. The target audience includes policymakers, academics, and organizations involved in human development and social welfare.
The objectives of this analysis are to answer questions such as:
What is the state of HDI in each region?
What are the key factors influencing HDI in each region?
Metode yang digunakan untuk analisis:
EDA (Exploratory Data Analysis)
Statistical methods such as analyzing correlations between variables
Data visualization to provide an overview of the HDI data
This analysis is significant because it allows others, including students like us, to understand the HDI conditions in different regions. While we may be well-off in our area, how are conditions in other regions? Is development evenly distributed or not? The main goal is to provide a clear picture of whether the HDI in a region meets the expected standards.
The dataset used by our group is the “IPM INDONESIA 2021” from Kaggle. The dataset consists of 519 rows and includes 12 variables. Dataset link : https://www.kaggle.com/datasets/fhadhai/dataipmindonesia/data
The variables in the dataset are as follows:
In this section, we performed preprocessing on the dataset. The preprocessing steps included:
# import library
library(dplyr)
library(skimr)
library(ggplot2)
library(pander)
library(plotly)
library(corrplot)
Above is the library that will be used for the data preprocessing, EDA and visualization stages.
df <- read.csv("ipm-indonesia2021-cluster.csv")
str(df)
## 'data.frame': 519 obs. of 12 variables:
## $ Provinsi : chr "ACEH" "ACEH" "ACEH" "ACEH" ...
## $ Kab.Kota : chr "Simeulue" "Aceh Singkil" "Aceh Selatan" "Aceh Tenggara" ...
## $ Persentase.Penduduk.Miskin..P0..Menurut.Kabupaten.Kota..Persen. : num 19 20.4 13.2 13.4 14.4 ...
## $ Rata.rata.Lama.Sekolah.Penduduk.15...Tahun. : num 9.48 8.68 8.88 9.67 8.21 ...
## $ Pengeluaran.per.Kapita.Disesuaikan..Ribu.Rupiah.Orang.Tahun. : int 7148 8776 8180 8030 8577 10780 9593 9644 9860 8867 ...
## $ Indeks.Pembangunan.Manusia : num 66.4 69.2 67.4 69.4 67.8 ...
## $ Umur.Harapan.Hidup..Tahun. : num 65.3 67.4 64.4 68.2 68.7 ...
## $ Persentase.rumah.tangga.yang.memiliki.akses.terhadap.sanitasi.layak : num 71.6 69.6 62.5 62.7 66.8 ...
## $ Persentase.rumah.tangga.yang.memiliki.akses.terhadap.air.minum.layak: num 87.5 78.6 79.7 86.7 83.2 ...
## $ Tingkat.Pengangguran.Terbuka : num 5.71 8.36 6.46 6.43 7.13 2.61 7.09 7.7 7.28 4.32 ...
## $ Tingkat.Partisipasi.Angkatan.Kerja : num 71.2 62.9 60.9 69.6 59.5 ...
## $ PDRB.atas.Dasar.Harga.Konstan.menurut.Pengeluaran..Rupiah. : int 1648096 1780419 4345784 3487157 8433526 5953118 7485861 10261585 7975099 10374480 ...
Variable names will be renamed to make them easier to read.
names(df) <- c("provinsi", "kab_kota", "persentase_penduduk_miskin", "lama_sekolah", "pengeluaran", "ipm", "umur_harapan_hidup", "persentase_akses_sanitasi", "persentase_akses_air", "tingkat_pengangguran", "tingkat_partisipasi_kerja", "PDRB")
str(df)
## 'data.frame': 519 obs. of 12 variables:
## $ provinsi : chr "ACEH" "ACEH" "ACEH" "ACEH" ...
## $ kab_kota : chr "Simeulue" "Aceh Singkil" "Aceh Selatan" "Aceh Tenggara" ...
## $ persentase_penduduk_miskin: num 19 20.4 13.2 13.4 14.4 ...
## $ lama_sekolah : num 9.48 8.68 8.88 9.67 8.21 ...
## $ pengeluaran : int 7148 8776 8180 8030 8577 10780 9593 9644 9860 8867 ...
## $ ipm : num 66.4 69.2 67.4 69.4 67.8 ...
## $ umur_harapan_hidup : num 65.3 67.4 64.4 68.2 68.7 ...
## $ persentase_akses_sanitasi : num 71.6 69.6 62.5 62.7 66.8 ...
## $ persentase_akses_air : num 87.5 78.6 79.7 86.7 83.2 ...
## $ tingkat_pengangguran : num 5.71 8.36 6.46 6.43 7.13 2.61 7.09 7.7 7.28 4.32 ...
## $ tingkat_partisipasi_kerja : num 71.2 62.9 60.9 69.6 59.5 ...
## $ PDRB : int 1648096 1780419 4345784 3487157 8433526 5953118 7485861 10261585 7975099 10374480 ...
There are 2 categorical variables that have inappropriate data
types (province and district). We change these two data types into
factors.
chr_to_factor <- c("provinsi", "kab_kota")
for (col in chr_to_factor) {
df[[col]] <- as.factor(df[[col]])
}
Next, we checked for anomalies in the data such as missing
values. duplicate data and delete the data.
# Checking Missing Value
colSums(is.na(df))
## provinsi kab_kota
## 0 0
## persentase_penduduk_miskin lama_sekolah
## 5 5
## pengeluaran ipm
## 5 5
## umur_harapan_hidup persentase_akses_sanitasi
## 5 5
## persentase_akses_air tingkat_pengangguran
## 5 5
## tingkat_partisipasi_kerja PDRB
## 5 5
# Delete missing value
df <- na.omit(df)
# Delete duplicate data
df <- df[!duplicated(df), ]
# Check whether the data still contains missing values or not
colSums(is.na(df))
## provinsi kab_kota
## 0 0
## persentase_penduduk_miskin lama_sekolah
## 0 0
## pengeluaran ipm
## 0 0
## umur_harapan_hidup persentase_akses_sanitasi
## 0 0
## persentase_akses_air tingkat_pengangguran
## 0 0
## tingkat_partisipasi_kerja PDRB
## 0 0
In this section, we carry out exploratory data analysis and display data visualization.
# Summary of numeric variables
select_df <- df %>% select(-provinsi, -kab_kota)
summary_df <- summary(select_df)
pander(summary_df)
| persentase_penduduk_miskin | lama_sekolah | pengeluaran | ipm |
|---|---|---|---|
| Min. : 2.38 | Min. : 1.420 | Min. : 3976 | Min. :32.84 |
| 1st Qu.: 7.15 | 1st Qu.: 7.510 | 1st Qu.: 8574 | 1st Qu.:66.64 |
| Median :10.46 | Median : 8.305 | Median :10196 | Median :69.61 |
| Mean :12.27 | Mean : 8.437 | Mean :10325 | Mean :69.93 |
| 3rd Qu.:14.89 | 3rd Qu.: 9.338 | 3rd Qu.:11719 | 3rd Qu.:73.11 |
| Max. :41.66 | Max. :12.830 | Max. :23888 | Max. :87.18 |
| umur_harapan_hidup | persentase_akses_sanitasi | persentase_akses_air |
|---|---|---|
| Min. :55.43 | Min. : 0.00 | Min. : 0.00 |
| 1st Qu.:67.39 | 1st Qu.:70.22 | 1st Qu.: 79.04 |
| Median :69.97 | Median :81.80 | Median : 89.80 |
| Mean :69.66 | Mean :77.20 | Mean : 85.14 |
| 3rd Qu.:72.04 | 3rd Qu.:89.88 | 3rd Qu.: 96.40 |
| Max. :77.73 | Max. :99.97 | Max. :100.00 |
| tingkat_pengangguran | tingkat_partisipasi_kerja | PDRB |
|---|---|---|
| Min. : 0.000 | Min. :56.39 | Min. : 147485 |
| 1st Qu.: 3.180 | 1st Qu.:65.07 | 1st Qu.: 3654292 |
| Median : 4.565 | Median :68.95 | Median : 8814926 |
| Mean : 5.059 | Mean :69.46 | Mean : 21964077 |
| 3rd Qu.: 6.530 | 3rd Qu.:72.34 | 3rd Qu.: 19735101 |
| Max. :13.370 | Max. :97.93 | Max. :460081046 |
The table above displays a summary of the statistical measures for
each numeric column. The table provides information on the minimum
value, first quartile, median, mean, third quartile, and maximum value.
Here, we create a visualization of the 10 provinces with the highest average HDI
# Calculate the average HDI per province
provinsi_ipm_tertinggi <- aggregate(df$ipm, by = list(df$provinsi), FUN=mean)
# Rename columns
colnames(provinsi_ipm_tertinggi)[1] ="provinsi"
colnames(provinsi_ipm_tertinggi)[2] ="ratarata"
# Make a table with the 10 provinces with the highest average HDI
top10_ipm_tertinggi <- provinsi_ipm_tertinggi[order(-provinsi_ipm_tertinggi$ratarata), ][1:10, ]
# Plotting graph bar
p <- ggplot(top10_ipm_tertinggi, aes(x = reorder(provinsi, ratarata), y = ratarata)) +
geom_bar(stat = "identity") +
labs(title = "10 provinces with the highest average HDI", x = "Provinsi", y = "Average Human Development Index") +
coord_flip()
ggplotly(p)
From the graph above, we can see the 10 provinces with the highest average HDI. Most of the provinces shown are provinces on the island of Java, showing the development gap between regions in Indonesia.
Here, we create a visualization of the 10 provinces with the lowest HDI average to compare with the highest.
# Calculate the average HDI per province
provinsi_ipm_terendah <- aggregate(df$ipm, by = list(df$provinsi), FUN=mean)
# Rename Columns
colnames(provinsi_ipm_terendah)[1] ="provinsi"
colnames(provinsi_ipm_terendah)[2] ="ratarata"
# Make a table with 10 provinces with the lowest average HDI
top10_ipm_tertinggi <- provinsi_ipm_tertinggi[order(provinsi_ipm_tertinggi$ratarata), ][1:10, ]
# Plotting graph bar
p <- ggplot(top10_ipm_tertinggi, aes(x = reorder(provinsi, -ratarata), y = ratarata)) +
geom_bar(stat = "identity") +
labs(title = "10 Provinces with the Lowest Average HDI", x = "Provinsi", y = "Average Human Development Index") +
coord_flip()
ggplotly(p)
The graph above shows the 10 provinces in Indonesia that have the lowest average Human Development Index. The graph also shows a significant gap between provinces with the highest and lowest HDI.
Scatterplot visualization of the relationship between HDI and average years of schooling
# plotting scatterplot
p <- ggplot(df, aes(x = lama_sekolah, y = ipm, color=provinsi)) +
geom_point() +
labs(title = "Relationship between HDI and Average Years of Schooling of the Population",
x = "Population School Years (Years)",
y = "Human Development Index") +
theme(plot.title = element_text(size=11))
ggplotly(p)
From this graph we can see that there is a clear positive correlation between the population’s years of schooling and the human development index. When the average length of schooling increases, the HDI also increases. This suggests that higher levels of education are associated with better human development outcomes.
The graph shows that education plays an important role in improving human development.
Scatterplot visualization of the relationship between per capita expenditure and HDI
# plotting scatterplot
p <- ggplot(df, aes(x = pengeluaran, y = ipm, color=provinsi)) +
geom_point() +
labs(title = "Relationship between HDI and Expenditure per Capita",
x = "Expenditure per Capita",
y = "Human Development Index") +
theme(plot.title = element_text(size=11))
ggplotly(p)
The graph shows the relationship between HDI and per capita expenditure. It can be seen that there is a positive relationship between HDI and expenditure, the higher the expenditure, the higher the HDI. This shows that an increase in per capita expenditure is correlated with an increase in quality of life as measured by the HDI.
Scatterplot visualization of the relationship between HDI and access to adequate sanitation
# plotting scatterplot
p <- ggplot(df, aes(x = persentase_akses_sanitasi, y = ipm, color=provinsi)) +
geom_point() +
labs(title = "The relationship between HDI and Access to Adequate Sanitation",
x = "Percentage of population who have access to adequate sanitation",
y = "Human Development Index") +
theme(plot.title = element_text(size=11))
ggplotly(p)
The graph shows the relationship between HDI and the percentage of the population who have access to adequate sanitation. There is a positive relationship between adequate sanitation and HDI. Through this graph, we can see that access to adequate sanitation contributes positively to increasing HDI.
In this section, a statistical analysis will be conducted in the form of correlation analysis of all numerical variables.
numerical_variables <- df %>% select_if(is.numeric)
matrix = cor(numerical_variables)
corrplot(matrix, method = "color", mar=c(1,1,1,1))
The percentage of poverty exhibits a strong negative correlation
with HDI, life expectancy, and access to clean water. This suggests that
as poverty decreases, these indicators tend to improve, highlighting an
inverse relationship. Similarly, HDI is strongly positively correlated
with life expectancy, years of schooling, and expenditure, indicating
that regions with higher human development index also tend to have
higher life expectancy, more years of schooling, and higher expenditure
level.
Additionally, unemployment rate shows a negative correlation with labor participation rate, suggesting that as unemployment decreases, labor participation increases. The PDRB (Gross Regional Domestic Product) has moderate to strong positive correlations with expenditure and years of schooling, indicating that regions with higher economic output also have higher expenditure and education levels.
Based on the analysis, it is evident that provinces with the highest HDI such as DKI Jakarta, D.I. Yogyakarta, and East Kalimantan have good access to quality education, adequate healthcare services, and developing infrastructure. Conversely, provinces with the lowest HDI like Papua, West Papua, and East Nusa Tenggara face challenges such as geographic constraints, limited resources, and high poverty rates that hinder their progress.
The factors significantly influencing HDI in each region include access to education, healthcare services, infrastructure, economic conditions, and government policies. Education and healthcare emerge as pivotal factors, where regions with robust educational facilities and healthcare services tend to exhibit higher HDI. Additionally, adequate infrastructure and strong economic conditions also contribute significantly to HDI improvement. On the other hand, remote and hard-to-reach areas generally have lower HDI due to limited access to basic services and economic opportunities.
Based on the analysis conducted :
Provinces with the highest HDI such as DKI Jakarta, D.I. Yogyakarta, East Kalimantan, and others have demonstrated good access to quality education, adequate healthcare services, and developed infrastructure. However, some provinces like Papua, West Papua, East Nusa Tenggara, and others still lack sufficient access.
Key factors influencing HDI in each region include access to education, healthcare services, infrastructure, economic conditions, and government policies. Education and healthcare are pivotal, where regions with strong educational and healthcare facilities tend to have higher HDI. Additionally, adequate infrastructure and a robust economy significantly contribute to HDI improvement. Conversely, remote and hard-to-reach areas generally have lower HDI due to limited access to basic services and economic opportunities.
To enhance HDI across Indonesia, it is recommended that the government and stakeholders prioritize investment in education and healthcare, especially in provinces with low HDI. Improving infrastructure in remote areas is also crucial to ensure better access to basic services. Furthermore, policies focusing on inclusive and equitable human development are needed to address regional disparities. With a deeper understanding of HDI and its influencing factors, development efforts can be more effective in achieving more equitable and sustainable progress across Indonesia.